Optimising a 3D convolutional neural network for head and neck computed tomography segmentation with limited training data

Highlights • Convolutional neural networks (CNNs) are used for auto-segmentation in radiotherapy.• However, CNNs rely on large, high-quality datasets: a scarcity in radiotherapy.• We develop a CNN model, trained with limited data, for accurate segmentation.• Multiple experiments were performed to optimise key features of our custom model.• Our model is competitive with state-of-the-art methods on a public dataset.


Introduction
The 3D segmentation of organs-at-risk (OARs) is a crucial step in the radiotherapy pathway. However, segmentation or delineation by clinicians is slow, expensive and prone to inter-and intra-observer variability even among experienced radiation oncologists [1]. Fully convolutional neural networks (CNNs) are now the state-of-the-art for automated medical image segmentation [2]. Recently, a considerable number of methods are have been proposed and implemented to perform segmentation faster and with higher consistency [3][4][5][6][7]. Cutting-edge radiotherapy workflows use auto-segmentation models to suggest contours which experienced radiographers will confirm and edit if required [8].
Supervised training of CNN models traditionally requires large amounts of high quality annotated data (often >1000s of examples) [9]. In this application full volumetric segmentation by radiographers, ideally with the same level of expertise and following the same guidelines, is needed for every image. As a result, high-quality sets of training data for auto-segmentation are often limited in size. Large institutions and commercial systems regularly use datasets containing 100s of images [10,11]. However, very few researchers have access to such large datasets. For 2D tasks, transfer learning from large, pre-trained backbone models, such as ResNet, is often used to improve performance when limited training data is available. Analogous backbone models are not yet readily accessible in 3D. This study aimed to develop a custom 3D CNN model capable of accurate auto-segmentation of head and neck (HN) OARs using a small, publicly available dataset (34 CTs) for training. The design space of CNN models is extensive and in addition to the volume of training data available, choices in the CNN architecture and training protocols can heavily impact model performance. We selected three key design elements to optimise in the development of our custom CNN.

CNN model architecture
Our base segmentation CNN was founded on the 3D UNet design [12]. This consists of an encoding pathway of repeating zero-padded 3x3x3 convolutional and pooling layers, followed by a decoding pathway of similar convolutions and up-sampling (Fig. 1). Residual skip connections were added to smooth the training process. These residual connections were implemented with 1x1x1 convolutional layers to match the channel number on either side of the convolutional block [13]. Multi-level deep supervision was introduced at each level in the decoding portion of the network to accelerate convergence. The deep supervision connections contain bottleneck 1x1x1 convolutions reducing the number of model parameters and enabling training on a single graphics processing unit (GPU).

Multiple input channels with specific contrast settings
Generally, images are pre-processed before being used as input for a CNN. Routine pre-processing consists of normalising images to have μ = 0 and σ = 1 or mapping the image onto the range [0,1].
An advantage of working with computed tomography (CT) scans is that voxel intensities are calibrated to Hounsfield units (HU), a scale of tissue density with fixed reference values at air (-1024 HU) and water (0 HU). Clinicians use windowing or grey-level mapping when visualising CT images to enhance the contrast of different tissues and highlight particular structures, for example, narrow windows are used for soft tissues with similar attenuation and wide windows for visualising bone. Image brightness is adjusted with the window level (L) and contrast is adjusted with the window width (W).
L and W define a ramp function that is used to map all intensities in a given image as shown in Fig. 2. Our proposed approach used three input channels, normalised with distinct contrast settings. The chosen W and L contrast settings are used by radiologists to specifically view soft tissue, bony anatomy and brain tissue [14]. The three distinctly contrasted CT volumes were concatenated along the "channels" axis and fed through the CNN simultaneously. This approach is analogous to separate RGB channels in 2D natural images.
We compared our proposed method with a comparison baseline that used a single input channel where the CT image is normalised with a fullwidth window. Practically this involved setting L = 488 and W = 3024, standardising the entire intensity range of the image onto the range [0, 1].

Resize convolutions
Transpose convolutions are frequently used to mimic the inverse convolution operation, increasing an image's spatial dimensions by dilating the input [15]. However, transpose convolutions can produce checkerboard artefacts in CNN-generated images [16]. Resize convolutions (up-sampling followed by a standard convolutional layer) have been proposed as a drop-in replacement to remedy such artefacts. We Fig. 1. The CNN architecture used in this study. The base model used was a 3D Res-UNet with deep supervision. In this figure we highlight the three modifications that form the presented experiments. We compared using multiple contrast settings for the model input (1), resize or transpose convolutions in the decoder portion (2) and three different loss functions (3). When using transpose convolutions (orange), we did not perform tri-linear up-sampling.
compare model configurations using traditional transpose and resize convolutions in the decoder portion of the segmentation CNN. In this work, 3D resize convolutions were implemented using tri-linear upsampling prior to a zero-padded 3x3x3 convolutional layer.

Loss functions
We performed experiments with three loss functions: a simple overlap metric; a linear combination of an overlap metric and crossentropy; and a similar combination with added non-linearities.
In segmentation tasks, overlap loss metrics have gained popularity due to their easy implementation and quick convergence. The first metric we evaluated was the multi-class weighted Soft Dice (wSD) loss function which is based on the Dice similarity coefficient (DSC). The wSD loss is given by where V is the CT volume and l is the OAR label. In 3D volumes there is often a significant class imbalance between background and labelled voxels which can be several orders of magnitude, especially for small OARs. OAR-specific weights, w l , are often added to address classimbalance. In this study, weights for each OAR were calculated with the inverse label frequency as with α = 1/3.
Cross-entropy (XE) is a popular loss function that evaluates target and prediction similarity with log-probabilities. The second metric implemented was composed of a linear combination of wSD and weighted XE (wSD + XE).
Wong et al. proposed an "Exponential Logarithmic Loss" function (ExpLogLoss) for segmentation of objects with high unbalanced object sizes. This loss function was originally designed for segmenting 3D brain MR images and is formed of a sum of logarithmic SD and weighted XE. We evaluated the impact of using an ExpLogLoss function with the suggested settings outlined in [17]. The ExpLogLoss function was calculated as where Dice l was the Dice similarity coefficient for OAR l, -ln(p l (x)) was the negative log likelihood loss and α = 0.5 for w l .

Implementation details
All our models were implemented in PyTorch 1.6.0. All network training was performed on a 16 GB NVidia Tesla V100 GPU. Individual model training took ∼5hrs. Segmentation inference took <1s per 3D CT image.
Extensive data augmentation was used to improve the robustness of the model. This was essential to prevent over-fitting when using a small training dataset. Throughout training the original CT images and gold standard segmentation masks were transformed with random sequences Fig. 2. a) The windowing ramp function to map CT image intensities. b) Contrast settings for the full-width window baseline approach. c-e) Window width and level contrast settings selected for our multiple input channels approach. of augmentations. The 3D augmentation operations include: lateral mirroring (with probability, p = 0.5); shifting of ±4 voxels maximum in each direction (p = 1, ±4 mm in-plane & ±10 mm axially); rotations between ±10 • to imitate cervical flexion, extension and rotation (p = 0.75); and volumetric scaling between 90-110% (p = 0.5). All augmentations were implemented using the numpy and scipy libraries.
The Adam optimiser was used with an initial learning rate of 10 − 2 , which was reduced by a factor of 10 each time the validation loss plateaued for 100 epochs. Models were allowed to train for up to 1000 epochs, with early stopping implemented if the validation loss failed to improve for 250 epochs. Due to the size of the 3D CNN and input CT volumes, the batch size was restricted to one. However, gradient accumulation was used to delay model parameter updates, which simulated a batch size of four.

Data and experimental setups
For model development we used a publicly-available open dataset of 34 CT images (https://github.com/deepmind/tcia-ct-scan-dataset) [11]. Each of the 34 HN CTs, with voxel resolution of 1x1x2.5 mm, have OAR delineations from two doctors. One set of delineations was treated as the gold standard and used for training. The CNN model was trained for 3D segmentation of the mandible, brainstem, parotid glands and the cervical section of the spinal cord. We performed experiments to assess every configuration of the three loss functions, multiple-vs. singlechannel contrast input and resize vs. transpose convolutions. Before segmentation the CTs were automatically cropped to anatomically consistent sub-volumes with the dimensions of 200x200x56 voxels using in-house software [18].
A 5-fold cross-validation was performed for each model configuration [19]. In each fold, a CNN model was trained from scratch using 24 training images and 3 validation images. In such a cross-validation, training data is used to adjust model parameters, whereas the validation data informs adjustments to the learning rate and when to terminate the training process. Sets of 7 testing images were held out and used to evaluate the final segmentation performance of the fold.

Segmentation performance metrics
Model segmentation performance was compared to the measured deviation between the two doctors, using both the 95th percentile Hausdorff distance (HD95) and mean distance-to-agreement (mDTA). A Wilcoxon signed-rank test of the second clinician and CNN HD95 samples was performed for each OAR with the null hypothesis that the differences of the medians are zero.
Overlap metrics such as DSC and the Jaccard index are often reported for semantic segmentations works. However, such metrics are heavily biased towards structure volume, insensitive to fine details as bulk overlap can hide clinically relevant differences between structure boundaries [3,20]. In radiotherapy, small deviations in the borders of segmentations can have a potentially serious impact, e.g. increasing the risk of side effects for the patient through unplanned irradiation of an OAR.
As such, distance metrics, such as the mDTA and HD95 [21], are preferred [22] and reported in this paper. To calculate these metrics, distance transform maps were created for the reference segmentation and sampled on the voxels on the boundary of the evaluated segmentations. We evaluated these distances symmetrically, i.e. using distance maps from the golden standard and sampling on the boundary voxels of the predicted segmentation and vice versa. These distances were then summarised by their mean (mDTA) and by their 95th percentile maximum distance (HD95). mDTA serves to assess the overall results and HD95 the worst matching region.

External validation
Our optimal model configuration was further validated on the public MICCAI Head and Neck Auto Segmentation Challenge 2015 dataset (version 1.4.1) [23]. This dataset (MICCAI'15 set) contains 48 patients which were originally divided into; 25 for training, 8 for optional additional training, 10 for offsite testing and 5 for onsite testing. We retrained our best configured model using the original set of 25 for training, the 5 onsite testing images for validation and the 10 offsite testing images for testing. The 8 samples in the original "optional training" set do not have all OARs delineated so were not included. Unfortunately, this dataset does not contain spinal cord delineations.
Our proposed model's results on the MICCAI'15 set were compared to the state-of-the-art (SOTA) results published on the same dataset by Huang et al. [4], Zhang et al. [5], Gao et al. [6], Gou et al. [7] and Kawahara et al. [27]. Each comparison method published either the HD95, the average surface distance (ASD), equivalent to mDTA, or neither of these. However, all five studies published DSC results, so we additionally calculated DSC results for our model on the MICCAI'15 set for comparison.

Model development
Descriptive results of the HD95 and mDTA metrics for every model configuration are shown in Table 1.
All models had similar HD95 performance on the spinal cord with the results consistently reflecting the CT slice thickness (2.5 mm). This suggests most models made errors in the spinal cord length by a single slice. Otherwise, across all OARs and both metrics, our consistently topperforming models were trained with the ExpLogLoss function.
To more closely examine models trained with the ExpLogLoss function, we compared the mDTA values for all such model configurations using box-plots in Fig. 3. The best performing model configuration used multiple input channels, transpose convolutions and was trained using the ExpLogLoss function. This configuration produced parotid gland, spinal cord and mandible segmentations with a similar level of accuracy to inter-clinician deviation. The only significant difference was found for the brainstem (p = 0.00008, Wilcoxon signed-rank test). However, the segmentation performance in the brainstem was still good, with a median HD95 of (3.37 ± 1.50)mm and median mDTA of (0.95 ± 0.37) mm.
Model configurations with multiple input contrast channels consistently outperformed the single input channel counterparts for segmentation performance in the soft tissue organs (brainstem, parotid glands and spinal cord). The segmentation performance was equivalent in the mandible regardless of input type.
When training with wSD and the wSD + XE combination loss functions, the transpose and resize convolutions performed very similarly. However, when training with the ExpLogLoss function, the transpose convolutions performed marginally better.

External validation
Our best-performing model configuration (three-channel input, transpose convolutions and ExpLogLoss function) was then re-trained and evaluated on the MICCAI'15 set. In Table 2 the HD95, mDTA and DSC metric results for our proposed method are presented.
In Fig. 4 we illustrate example segmentations produced by our proposed method. Fig. 4a and 4b show 2D axial and sagittal slices of a patient from our original dataset. Fig. 4c and 4d are examples of a patient from the MICCAI'15 set used for external validation.

Discussion
CNNs, which are now being used for auto-segmentation in radiotherapy planning, typically require large datasets to train effectively. We developed a CNN model capable of accurate HN CT segmentation when trained on a small dataset. This was achieved through careful tuning of a customised 3D CNN. Varying particular elements of our model provided insight into what impacts the performance of CNN auto-segmentation methods, in particular UNet-based architectures.
Using multiple contrast settings for the model input was a key strategy to improve segmentation performance for soft-tissue OARs. We compared our three pre-selected contrast channels to a baseline input using a single full-width window. We did not exhaustively compare each contrast window individually or in combinations as this would have greatly increased the number of experiments. It is possible that using just one of our contrasted channels could be sufficient in some situations (e. g. for the mandible). However, identifying these specific situations adds considerable complexity to the auto-segmentation task and using all three channels adds little computational load in both training and inference stages. Additionally, we did not optimise the contrast settings used in this study, instead relying on values sourced from literature [14]. In 2018, Lee et al. developed a window setting optimisation module that implements contrast normalisation as a learnable parameter of the model [24]. It would be interesting to discover whether a similar module could be deployed successfully within our methodology.
The loss function is a crucial component of training a deep learning model. We found the ExpLogLoss function, originally developed by Wong et al. [17], produced higher accuracy segmentation models compared to simpler soft Dice and cross-entropy combination functions. Lu et al. reported concurring results for a similar loss function when applied to 3D stroke lesion segmentation in T1 weighted MR images [25]. In future work it would be of interest to evaluate the recently introduced "Unified Focal" loss function which performs well for highly Table 1 Median values of the HD95 and meanDTA metrics for every model configuration. Lower values show closer agreement between the CNN predicted segmentations and the gold standard. In this table T and R indicate models using either Transpose or Resize convolutions in the decoder portion respectively. These results are summarised by the median and standard deviation of the metrics for each OAR across all patients in the five test set folds. The best performing configurations for each OAR are highlighted in bold font and are determined with more significant figures than shown. For the HD95, the majority of the spinal cord results reflect the CT image slice thickness (2.50 mm). This suggests most models made errors in the spinal cord length by a single slice. Model configurations trained with the ExpLogLoss function consistently produce better segmentations.  imbalanced class segmentation [26]. Unexpectedly, the checkerboard artefacts from transpose convolutions, described in Section 2.3, did not noticeably degrade segmentation performance. Resize convolutions have become fairly prevalent among published methods to avoid this issue. However, our proposed method produced better segmentations when using transpose convolutions. Additionally, models with transpose convolutions trained ∼15% quicker as a result of containing ∼1.6 million fewer parameters.
Once the development phase was complete, our best model config-uration was evaluated on a public dataset which has been used as a benchmark for several SOTA methods. The results published by Huang et al. [4], Zhang et al. [5], Gao et al. [6] and Gou et al. [7] are shown in Table 2 [28]. However, this model was trained with the MIC-CAI'15 dataset and 24 additional CT scans so we could not include their results in Table 2. Our method has been specifically developed to leverage limited data, allowing for custom models to be trained on small datasets to segment different OARs or according to an updated protocol. Protocol-specific models can then be deployed in applications such as retrospective modelling studies or clinical trials to improve consistency. A natural extension for this study would be to further evaluate how model performance changes as the size of the training set changes. Siciarz [29]. However the fewest number of training samples considered by Siciarz et al. was still almost twice the size of the dataset used in this study.
In this study, we showed that through careful tuning and customisation a 3D CNN can be trained with a small dataset to segment the mandible, parotid glands and spinal cord with an accuracy that is similar to the magnitude of inter-clinician deviation. We evaluated our proposed model on a popular public dataset and produced high-quality segmentation results that were competitive with current state-of-theart methods in multiple metrics.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.